A Software Implemented Fault-tolerance Layer for Reliable Computing on Massively Parallel Computers and Distributed Computing Systems

نویسندگان

Gavin D. Holland

Dhiraj K. Pradhan

چکیده

A novel architecture for a software-implemented fault-tolerance layer for application reliability on massively parallel computers and distributed computing systems is proposed. This is the rst attempt at providing a purely software-based, user-level solution for fault detection, reconnguration, and recovery in a parallel environment. The symmetrically distributed, multi-tiered layer envelopes user applications, enabling it to perform fault-tolerance related actions apart from, and transparent to the application. It's modular design enables dynamic runtime selection of the most appropriate fault-tolerant algorithm, and is, therefore, not restricted to one particular fault-tolerant method. Performance and coverage measurements of a minimal implementation of the proposed layer are presented, and indicate that user-level software-implemented fault-tolerance is realistically doable and reasonably eecient and eeective.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving the palbimm scheduling algorithm for fault tolerance in cloud computing

Cloud computing is the latest technology that involves distributed computation over the Internet. It meets the needs of users through sharing resources and using virtual technology. The workflow user applications refer to a set of tasks to be processed within the cloud environment. Scheduling algorithms have a lot to do with the efficiency of cloud computing environments through selection of su...

متن کامل

An approach to fault detection and correction in design of systems using of Turbo ‎codes‎

We present an approach to design of fault tolerant computing systems. In this paper, a technique is employed that enable the combination of several codes, in order to obtain flexibility in the design of error correcting codes. Code combining techniques are very effective, which one of these codes are turbo codes. The Algorithm-based fault tolerance techniques that to detect errors rely on the c...

متن کامل

On Integrating Error Detection into a Fault Diagnosis Algorithm for Massively Parallel Computers

Scalable fault diagnosis is necessary for constructing fault tolerance mechanisms in large massively parallel multiprocessor systems. The diagnosis algorithm must operate efficiently even if the system consists of several thousand processors. In this paper we introduce an event-driven, distributed system-level diagnosis algorithm. It uses a small number of messages and is based on a general dia...

متن کامل

Supervised Workpools for Reliable Massively Parallel Computing

The manycore revolution is steadily increasing the performance and size of massively parallel systems, to the point where system reliability becomes a pressing concern. Therefore, massively parallel compute jobs must be able to tolerate failures. For example, in the HPCGAP project we aim to coordinate symbolic computations in architectures with 10 cores. At that scale, failures are a real issue...

متن کامل

A Distributed Object Based Framework for Parallel Computations

The computational and compositional features are very important while constructing parallel software for the workstation clusters. However, lack of suitable supporting environment for parallel software development makes most existing distributed parallel software systems very weak in these two aspects, especially in the compositional feature. In this paper, a distributed object based framework ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

A Software Implemented Fault-tolerance Layer for Reliable Computing on Massively Parallel Computers and Distributed Computing Systems

نویسندگان

چکیده

منابع مشابه

Improving the palbimm scheduling algorithm for fault tolerance in cloud computing

An approach to fault detection and correction in design of systems using of Turbo ‎codes‎

On Integrating Error Detection into a Fault Diagnosis Algorithm for Massively Parallel Computers

Supervised Workpools for Reliable Massively Parallel Computing

A Distributed Object Based Framework for Parallel Computations

عنوان ژورنال:

اشتراک گذاری